Urban Greenspace and Asthma Prevalence in Columbus, Ohio in 2023¶
Max Warnock
2/3/26
Introduction¶
There are many studies pointing to the value of urban green space in promoting human physical and mental health. Benefits of urban green space include offsetting greenhouse gas emissions, reducing urban heat island effects, and collecting storm water, and improving air quality. Urban green spaces can also provide increased opportunities for outdoor physical activity and social interactions which can improve mental health (Lee, et al).
For this project, I will be studying the relationship between urban green space and asthma rates in Columbus, Ohio. According to the Ohio Department of Health, Columbus has some of the highest asthma rates in the Midwest. Vegetation metrics that quantify the structure and connectivity of greenspace have found correlations to increased human health, as shown in studies by Tsai et al, and Dong et al. To quantify green space, I will be analyzing vegetation metrics of mean patch size, edge density, and fragmentation.
In this notebook below, I will use Python code to calculate patch, edge, and fragmentation statistics for urban greenspace in Columbus, Oh. These statistics should be reflective of the connectivity and spread of urban greenspace, which are important for ecosystem function and access. I will then use a linear model to identify statistically significant correlations between the distribution of greenspace and health data compiled by the US Center for Disease Control.
Data¶
Centers for Disease Control (CDC) Places Data.¶
I will be using this data source for asthma data by census tract in Columbus, OH in 2023. This dataset also provides modified census tracts clipped to the city boundary which will allow me to map the data more easily.
National Agricultural Imagery Program (NAIP) Data.¶
For vegetation data, I will be using NAIP data which has a very high resolution of 1m. This data will allow me to better understand the structure of the greenspace, rather than simply quantify it which could be done with Landsat or MODIS data (lower resolution, 250m). Since this is a very large dataset, I will be using a cloud-based solution (Microsoft Planetary Computer) to handle the data download.
Works Cited:
Lee AC, Jordan HC, Horsley J. Value of urban green spaces in promoting healthy living and wellbeing: prospects for planning. Risk Manag Healthc Policy. 2015 Aug 27;8:131-7. doi: 10.2147/RMHP.S61654. PMID: 26347082; PMCID: PMC4556255.
Tsai, Wei-Lun, Yu-Fai Leung, Melissa R. McHale, Myron F. Floyd, and Brian J. Reich. 2019. “Relationships Between Urban Green Land Cover and Human Health at Different Spatial Resolutions.” Urban Ecosystems 22 (2): 315–24. https://doi.org/10.1007/s11252-018-0813-3.
Dong Liu, Mei-Po Kwan, Zihan Kan. Analysis of urban green space accessibility and distribution inequity in the City of Chicago. https://doi.org/10.1016/j.ufug.2021.127029
Ohio Department of Health. Data and Statistics. https://odh.ohio.gov/know-our-programs/asthma-program/data-and-statistics
Step 1. Setting up the analysis¶
# Install sodapy if you haven't already
# !pip install sodapy
### Import libraries
### File paths and organization
import os
import pathlib
### See Warnings
import warnings
### Different data types
import pandas as pd
import geopandas as gpd
import numpy as np
import xarray as xr
import rioxarray as rxr
import rioxarray.merge as rxrmerge
import shapely
import time
### plotting and visualization packages
import geoviews as gv
import holoviews as hv
import hvplot.pandas
import hvplot.xarray
from cartopy import crs as ccrs
### data exploration and visualization
import matplotlib
import scipy.stats as stats
import matplotlib.pyplot as plt
from bokeh.models import FixedTicker
### image processing
from scipy.ndimage import convolve
from scipy.ndimage import label
### working with data in the cloud
import pystac_client
from sodapy import Socrata
### progress bar
from tqdm.notebook import tqdm
### OLS sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
Create reproducible file paths within your home directory
### Create reproducible file paths
data_dir = os.path.join(
### home directory
pathlib.Path.home(),
### Earth analytics folder
"earth-analytics","data",
### Project directory
"columbus-asthma",
)
### make the directory
os.makedirs(data_dir,exist_ok=True)
Provent code disruptions from potential momentary connection lapse. This is very important for the long process of downloading NAIP data.
### revent GDAL from quitting due to momentary disruptions
os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"
Step 2. Create a Site Map¶
Create a quick visualization of the city of Columbus census tracts (from the CDC Places dataset) overlaid on basic Esri satellite imagery. Since there are two cities named Columbus in the U.S, I had to select Columbus OH, using it's designated city code.
### Set up the census tract path
### Define directory
tract_dir = os.path.join(data_dir,"columbus-tract")
os.makedirs(tract_dir,exist_ok=True)
### make path to directory
tract_path = os.path.join(tract_dir,"columbus_tract.shp")
### Load in census tract data
if not os.path.exists(tract_path):
### put in census url
tract_url = ("https://data.cdc.gov/download/x7zy-2xmx/application%2Fzip")
### read the shape file
tract_gdf = gpd.read_file(tract_url)
### subset to Columbus Ohio
columbus_tract_gdf = tract_gdf[(tract_gdf["PlaceName"] == "Columbus city") & (tract_gdf["ST"] == "39")]
### save it as a file
columbus_tract_gdf.to_file(tract_path,index=False)
### Load in the census tract data
columbus_tract_gdf = gpd.read_file(tract_path)
columbus_tract_gdf
### Site plot -- Census tracts with satellite imagery in the background
(
columbus_tract_gdf
.to_crs(ccrs.Mercator())
### plot with satellite imagery in the background
.hvplot(
line_color = 'orange', fill_color = None,
crs = ccrs.Mercator(), tiles = 'EsriImagery',
frame_width = 600, title = "Columbus, OH Census Tracts"
)
)
Figure 1: From the visualization above, it's clear that Columbus has a mix of highly developed areas, green areas, and some rivers and water ways. Additionally, the surrounding area appears to be farm land.
STEP 3 - Access Asthma and Urban Greenspaces Data¶
Step 3a. In the step below, I will download the asthma data from the the CDC Places Dataset for 2023.
### Set up a path for the asthma data
cdc_path = os.path.join(data_dir, 'asthma.csv')
### Download asthma data (only once)
if not os.path.exists(cdc_path):
cdc_url = (
### url for data
"https://data.cdc.gov/resource/cwsq-ngmh.csv"
### parameters to filter on (Franklin is the county for Columbus)
"?year=2023"
"&stateabbr=OH"
"&countyname=Franklin"
"&measureid=CASTHMA"
"&$limit=1500"
)
cdc_df = (
### open it as a csv
pd.read_csv(cdc_url)
### rename the cols
.rename(columns = {
'data_value': 'asthma',
'low_confidence_limit': 'asthma_ci_low',
'high_confidence_limit': 'asthma_ci_high',
'locationname': 'tract'})
### select columns I want to keep
[[
'year', 'tract',
'asthma', 'asthma_ci_low', 'asthma_ci_high',
'data_value_unit', 'totalpopulation',
'totalpop18plus'
]]
)
else:
cdc_df = pd.read_csv(cdc_path)
### Load in asthma data
cdc_df.to_csv(cdc_path, index = False)
### Preview asthma data
cdc_df
| year | tract | asthma | asthma_ci_low | asthma_ci_high | data_value_unit | totalpopulation | totalpop18plus | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2023 | 39049009397 | 12.8 | 11.4 | 14.1 | % | 2977 | 2418 |
| 1 | 2023 | 39049006237 | 9.7 | 8.7 | 10.8 | % | 5953 | 5051 |
| 2 | 2023 | 39049007114 | 12.0 | 10.8 | 13.3 | % | 5130 | 3943 |
| 3 | 2023 | 39049010204 | 12.4 | 11.2 | 13.8 | % | 6684 | 4922 |
| 4 | 2023 | 39049007961 | 10.0 | 8.9 | 11.1 | % | 4974 | 3401 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 322 | 2023 | 39049006392 | 9.4 | 8.5 | 10.5 | % | 4644 | 3485 |
| 323 | 2023 | 39049006302 | 9.9 | 8.9 | 11.0 | % | 4613 | 3674 |
| 324 | 2023 | 39049000600 | 11.1 | 9.9 | 12.5 | % | 3839 | 3492 |
| 325 | 2023 | 39049006923 | 11.5 | 10.3 | 12.8 | % | 3957 | 2866 |
| 326 | 2023 | 39049009373 | 13.1 | 11.8 | 14.5 | % | 6184 | 4241 |
327 rows × 8 columns
Step 3b. Join CDC asthma data with census tract boundaries and plot the result.
### Change tract identifier datatype for merging
columbus_tract_gdf.tract2010 = columbus_tract_gdf.tract2010.astype('int64')
### Merge census data with geometry
tract_cdc_gdf = (
columbus_tract_gdf
.merge(cdc_df,
### specify cols to merge on
left_on = 'tract2010',
right_on = 'tract',
### use inner join
how = 'inner'
)
)
# tract_cdc_gdf
# Numeric ticks to keep on the colorbar
numeric_ticks = [10, 11, 12, 13, 14, 15]
# markers
name_ticks = [10, 15]
# Combine them
all_ticks = numeric_ticks + name_ticks
(
### load basemap with satellite imagery
gv.tile_sources.EsriImagery
*
### map census tracts with asthma data
gv.Polygons(
### reproject to align CRS
tract_cdc_gdf.to_crs(ccrs.Mercator()),
### select cols
vdims=['asthma', 'tract2010'],
### ensure CRSs align
crs=ccrs.Mercator()
).opts(
color='asthma',
colorbar=True,
tools=['hover'],
title="Columbus, OH Asthma Rates in 2023",
clabel='Asthma rate',
colorbar_opts={
'ticker': FixedTicker(ticks=all_ticks),
'major_label_overrides': {
10: 'Lower asthma rate',
15: 'Higher asthma rate'
}
}
)
).opts(width=600, height=600, xaxis=None, yaxis=None)
Figure 2: This map shows the CDC Places data for asthma rates in Columbus. The darker blues represent high asthma rates than lighter colors. Unfortunately, it appears that some census tracts do not have asthma data. They are represented as clear on the map. However, it appears there is still data for most of the city, so I think it's okay to preceed with the anaylsis. There are some interesting spatial trends. For example, the northwest side of the city appears to have significantly lower asthma rates than the central and southern parts of the city. The map also generally agrees with my previous research that Columbus in general has high rates of asthma (Ohio Department of Health).
Step 3c. Get URLs for urban greenspace imagery.
Below, I'm connecting to the Microsoft Planetary Computer SpatioTemporal Access Catalog (STAC).
### Connect to the planetary computer catalog
e84_catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1"
)
The loop below collects URLs of NAIP imagery for each tract that we are studying in Columbus. It will not collect data for census tracts that we do not have asthma data for.
This loop goes through every tract value in tract_latlong_gdf. For each tract, the loop extracts the tract id, checks if the tract has been downloaded yet, searches for NAIP imagery from the STAC, builds a dataframe with the tract ID, date, and URL, and has built in backups in case there are internet disruptions. After the loop has finished, the dataframes are concatenated into a complete object (scene_df) that contains all the tracts, dates, and urls.
### Convert geometry to lat/lon for STAC
tract_latlong_gdf = tract_cdc_gdf.to_crs(4326)
### Define a path to save NDVI stats
ndvi_stats_path = os.path.join(data_dir, "columbus-ndvi-stats.csv")
### Check for existing data - do not access duplicate tracts
### initialize an empty list for the census tract IDs
downloaded_tracts = []
### write the conditional
if os.path.exists(ndvi_stats_path):
### if it exists, open the csvs in the path
ndvi_stats_dr = pd.read_csv(ndvi_stats_path)
### fill the list with the tract values
downloaded_tracts = ndvi_stats_dr.tract.values
### if it doesn't exist, print statement
else:
print("No census tracts downloaded so far")
### Loop through each census tract
### list to accumulate into
scene_dfs = []
### loop through each tract value in the gdf
for i, tract_values in tqdm(tract_latlong_gdf.iterrows()):
### for i, extract the tract id
tract = tract_values.tract2010
### Check if statistics are already downloaded for this tract
if not (tract in downloaded_tracts):
### Repeat up to 5 times in case of a momentary disruption
i = 0 ### initialize retry counter
retry_limit = 5 ### max number of tries
while i < retry_limit:
### Try accessing the STAC
try:
### Search for tiles
naip_search = e84_catalog.search(
collections = ['naip'],
intersects = shapely.to_geojson(tract_values.geometry),
datetime="2021"
)
### Build dataframe with tracts and tile urls
scene_dfs.append(pd.DataFrame(dict(
### column called tract
tract = tract,
date = [pd.to_datetime(scene.datetime).date()
for scene in naip_search.items()],
rgbir_href = [scene.assets['image'].href for scene in naip_search.items()],
)))
### exit retry loop once STAC request succeeds
break
### Try again in case of an APIError
except pystac_client.exceptions.APIError:
print(
f'Could not connect with STAC server.'
f'Retrying tract {tract}...')
### wait 2 seconds before retrying
time.sleep(2)
i += 1
continue
### Concatenate the url dataframes
if scene_dfs:
### concatenate
scene_df = pd.concat(scene_dfs).reset_index(drop = True)
### otherwise
else:
scene_df = None
### Preview the url dataframe
scene_df
0it [00:00, ?it/s]
# Save the dataframe to a csv
scene_df.to_csv(ndvi_stats_path, index=False)
Step 3d. Compute NDVI Statistics
Here are the three different vegetation statistics that we will be calculating:
Percentage vegetation: The percentage of pixels that exceed a vegetation threshold (.12 is common with Landsat)
Mean patch size: The average size of patches, or contiguous area exceeding the vegetation threshold. Patches can be identified with the
labelfunction fromscipy.ndimageEdge density: The proportion of edge pixels among vegetated pixels. Edges can be identified by convolving the image with a kernel designed to detect pixels that are different from their surroundings.
### make csv for ndvi data
ndvi_stats_path_veg = os.path.join(data_dir, 'columbus-ndvi-stats-veg.csv')
Loop to compute NDVI statistics¶
An important goal of this loop is to compute NDVI statistics per census tract. Therefore, we need to clip the NAIP imagery to each tract. The below loop goes through each census tract of interest and clips NAIP imagery to the geometry of that census tract. Then, we compute NDVI statistics for the clipped NAIP imagery.
Mean Patch Size: From the NDVI statistics, we are able to create a vegetation mask where we select all pixels above a threshold. In this case, 0.3, meaning we are selecting all pixels where vegetation was above 0.3 NDVI. We then use this in conjunction with a count of the total pixels in that tract to calculate mean patch size. We use the 'label' function to label patches where the pixels are above 0.3.
Edge Density: To identify edges, we will use a kernel to identify from the vegetation mask, which vegetation pixels are edge pixels. The kernel is applied to all pixels in the tract (this is sometimes referred to as a moving window or focal statistics). Because we are using a kernel, the outer edge pixels of the tract will be lost.
Below is the 3x3 kernel that I will be using. The code goes through each pixel and multiplies the pixel value by the kernel.
$$ \text{Kernel} = \begin{bmatrix} 1 & 1 & 1 \\ 1 & -8 & 1 \\ 1 & 1 & 1 \end{bmatrix} $$
Kernel results: If the center pixel is the same as the surrounding pixels, its value in the final matrix will be $-8 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 0$. If it is higher than the surroundings, the result will be negative, and if it is lower than the surroundings, the result will be positive. As such, the edge pixels of our patches will be negative.
Kernel and explanation of how it works credit: CU Boulder Earth Lab
Percent Vegetation: We can calculate percent vegetation simply by dividing the number of vegetation pixels (veg_mask) by the total pixels in the tract.
This loop runs through all these steps to calculate mean patch size, edge density, and percent vegetation. It collects all these calculated statistics in a dataframe which can be saved as a csv file. This is a long process and the loop may take several hours.
### Skip this step if data are already downloaded
if not scene_df is None:
### Loop through the census tracts with URLs
for tract, tract_date_gdf in tqdm(scene_df.groupby('tract')):
### Open all images for tract
### create a list to store NDVI arrays, with 1 array per NAIP image
tile_das = []
### iterate over rows in tract specific dataframe
for _, href_s in tract_date_gdf.iterrows():
### Open vsi connection to data
tile_da = rxr.open_rasterio(
### mask and squeeze
href_s.rgbir_href, masked = True).squeeze()
### Clip data
boundary = (
tract_cdc_gdf
.set_index('tract2010')
.loc[[tract]]
.to_crs(tile_da.rio.crs)
.geometry
)
### Crop to bounding box
crop_da = tile_da.rio.clip_box(
*boundary.envelope.total_bounds,
auto_expand = True
)
### clip to actual tract geometry
clip_da = crop_da.rio.clip(boundary, all_touched = True)
### Compute NDVI
ndvi_da = (
(clip_da.sel(band = 4) - clip_da.sel(band = 1))
/ (clip_da.sel(band = 4) + clip_da.sel(band = 1)))
### Accumulate result
tile_das.append(ndvi_da)
### Merge data
scene_da = rxrmerge.merge_arrays(tile_das)
### Mask vegetation
veg_mask = (scene_da > .3)
### Calculate statistics and save data to file
### count all valid pixels in the tract
total_pixels = scene_da.notnull().sum()
### count all vegetated pixels
veg_pixels = veg_mask.sum()
### identify vegetation patches
labeled_patches, num_patches = label(veg_mask)
### count patch pixels
patch_sizes = np.bincount(labeled_patches.ravel())[1:]
### mean patch size
mean_patch_size = patch_sizes.mean()
### Calculate edge density
### define the kernel
kernel = np.array([
[1, 1, 1],
[1, -8, 1],
[1, 1, 1]])
### detect boundaries between veg and non-veg
edges = convolve(veg_mask, kernel, mode='constant')
### calculate proportion between veg and non-veg
edge_density = np.sum(edges != 0) / veg_mask.size
### Add a row to the statistics file for this tract
### create new dataframe for the tract and append it to the csv
pd.DataFrame(dict(
### store tract ID
tract = [tract],
### store total pixels
total_pixels = [int(total_pixels)],
### fraction of veg pixels
frac_veg = [float(veg_pixels / total_pixels)],
### store the mean patch size
mean_patch_size = [mean_patch_size],
### edge density
edge_density = [edge_density]
### write out as a csv
)).to_csv(
ndvi_stats_path_veg,
### append mode
mode = 'a',
### get rid of row numbers
index = False,
### write column names
header = (not os.path.exists(ndvi_stats_path_veg))
)
### check the file
ndvi_stats_df = pd.read_csv(ndvi_stats_path_veg)
ndvi_stats_df
| tract | total_pixels | frac_veg | mean_patch_size | edge_density | |
|---|---|---|---|---|---|
| 0 | 39049000110 | 5458912 | 0.467575 | 148.579661 | 0.189672 |
| 1 | 39049000120 | 8008243 | 0.516840 | 163.790344 | 0.125395 |
| 2 | 39049000210 | 4766663 | 0.532251 | 206.634631 | 0.179721 |
| 3 | 39049000220 | 5410607 | 0.385892 | 134.296520 | 0.114323 |
| 4 | 39049000310 | 6862911 | 0.210889 | 87.108697 | 0.157271 |
| ... | ... | ... | ... | ... | ... |
| 199 | 39049010100 | 17816122 | 0.260141 | 80.683443 | 0.144937 |
| 200 | 39049010300 | 6793594 | 0.062171 | 31.383935 | 0.043643 |
| 201 | 39049010601 | 2797819 | 0.475241 | 243.612862 | 0.107467 |
| 202 | 39049010602 | 2386501 | 0.497720 | 211.128688 | 0.093800 |
| 203 | 39049010700 | 4169917 | 0.544232 | 239.894609 | 0.169301 |
204 rows × 5 columns
STEP 4 - Explore the data with plots¶
First, we need to merge all our data together into one geodataframe that we can use for plotting. See below.
### Merge census data with geometry
columbus_ndvi_cdc_gdf = (
### grab census tract gdf
tract_cdc_gdf
### merge with ndvi df
.merge(
ndvi_stats_df,
### match on the tract ID
### left: tract2010
### right: tract
left_on = 'tract2010',
right_on = 'tract',
how = 'inner'
)
)
### check it out
columbus_ndvi_cdc_gdf
| place2010 | tract2010 | ST | PlaceName | plctract10 | PlcTrPop10 | geometry | year | tract_x | asthma | asthma_ci_low | asthma_ci_high | data_value_unit | totalpopulation | totalpop18plus | tract_y | total_pixels | frac_veg | mean_patch_size | edge_density | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3918000 | 39049000110 | 39 | Columbus | 3918000-39049000110 | 3344 | POLYGON ((-9241687.822 4875035.426, -9241688.4... | 2023 | 39049000110 | 10.3 | 9.1 | 11.4 | % | 3489 | 3015 | 39049000110 | 5458912 | 0.467575 | 148.579661 | 0.189672 |
| 1 | 3918000 | 39049000120 | 39 | Columbus | 3918000-39049000120 | 3162 | POLYGON ((-9239407.998 4873815.462, -9239390.7... | 2023 | 39049000120 | 10.0 | 8.9 | 11.1 | % | 3220 | 2716 | 39049000120 | 8008243 | 0.516840 | 163.790344 | 0.125395 |
| 2 | 3918000 | 39049000210 | 39 | Columbus | 3918000-39049000210 | 2935 | POLYGON ((-9239360.353 4872960.063, -9239346.3... | 2023 | 39049000210 | 10.2 | 9.1 | 11.4 | % | 3049 | 2512 | 39049000210 | 4766663 | 0.532251 | 206.634631 | 0.179721 |
| 3 | 3918000 | 39049000220 | 39 | Columbus | 3918000-39049000220 | 3727 | POLYGON ((-9239346.327 4872707.908, -9239330.8... | 2023 | 39049000220 | 10.0 | 8.9 | 11.1 | % | 4186 | 3398 | 39049000220 | 5410607 | 0.385892 | 134.296520 | 0.114323 |
| 4 | 3918000 | 39049000310 | 39 | Columbus | 3918000-39049000310 | 3119 | POLYGON ((-9239195.377 4872909.315, -9239210.7... | 2023 | 39049000310 | 11.6 | 10.4 | 13.0 | % | 3377 | 2650 | 39049000310 | 6862911 | 0.210889 | 87.108697 | 0.157271 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 199 | 3918000 | 39049007922 | 39 | Columbus | 3918000-39049007922 | 0 | POLYGON ((-9254031.153 4874089.169, -9254045.3... | 2023 | 39049007922 | 10.3 | 9.2 | 11.5 | % | 6201 | 4821 | 39049007922 | 46829 | 0.352816 | 58.381625 | 0.038817 |
| 200 | 3918000 | 39049007931 | 39 | Columbus | 3918000-39049007931 | 0 | MULTIPOLYGON (((-9252946.678 4868317.011, -925... | 2023 | 39049007931 | 10.5 | 9.3 | 11.7 | % | 4305 | 3181 | 39049007931 | 3097110 | 0.028144 | 142.893443 | 0.012675 |
| 201 | 3918000 | 39049008230 | 39 | Columbus | 3918000-39049008230 | 0 | POLYGON ((-9252589.564 4857148.339, -9252620.5... | 2023 | 39049008230 | 14.3 | 12.9 | 15.9 | % | 3178 | 1792 | 39049008230 | 138983 | 0.105373 | 143.578431 | 0.066807 |
| 202 | 3918000 | 39049008500 | 39 | Columbus | 3918000-39049008500 | 0 | POLYGON ((-9241942.521 4861874.191, -9241967.7... | 2023 | 39049008500 | 10.1 | 9.0 | 11.4 | % | 5702 | 4784 | 39049008500 | 2643186 | 0.119766 | 34.042908 | 0.061626 |
| 203 | 3918000 | 39049009386 | 39 | Columbus | 3918000-39049009386 | 0 | MULTIPOLYGON (((-9218532.58 4856069.647, -9218... | 2023 | 39049009386 | 13.4 | 12.1 | 14.8 | % | 3279 | 2263 | 39049009386 | 2486908 | 0.011225 | 20.346939 | 0.014952 |
204 rows × 20 columns
Next, we will make a function for plotting maps as chloropleths. Since we are plotting a lot of maps, using a function is more efficient.
### Plot chloropleths with vegetation statistics
### make a plot_chloropleth function
def plot_chloropleth(gdf, **opts):
### docstring to describe the function
"""Generate a chloropleth with the given color column"""
### use geoviews to make a polygon object
return gv.Polygons(
### use mercator
gdf.to_crs(ccrs.Mercator()),
### set plot crs to mercator
crs = ccrs.Mercator()
### drop x and y axis and add a legend
).opts(xaxis = None, yaxis = None, colorbar = True, **opts)
Apply the function to Columbus asthma and vegetation data
### apply the function to Columbus asthma and vegetation data
(
plot_chloropleth(
### plot asthma by census tract
columbus_ndvi_cdc_gdf,
color = 'asthma',
cmap = 'viridis',
title = 'Asthma Rates in 2023',
clabel='Low asthma rate High asthma rate',)
+
### add a second plot with vegetation edge density
plot_chloropleth(
columbus_ndvi_cdc_gdf,
color = 'edge_density',
cmap = 'Greens',
title = 'Vegetation Edge Density in 2021',
clabel='Low edge den High edge den')
+
### add a third plot with fraction of vegetation per tract
plot_chloropleth(
columbus_ndvi_cdc_gdf,
color = 'frac_veg',
cmap = 'Greens',
title = 'Percent Vegetation in 2021',
clabel='Low % vegetation High % vegetation')
+
### add a second plot with vegetation mean patch size
plot_chloropleth(
columbus_ndvi_cdc_gdf,
color = 'mean_patch_size',
cmap = 'Greens',
title = 'Mean Vegetation Patch Size (MPS) in 2021',
clabel='Low MPS High MPS')
).cols(2)
### add 2 columns to plot the maps in a 2x2 grid
Figure 3: Chloropleth maps showing asthma rates as well as the other vegetatiom metrics that we calculated above.
STEP 5: Exploring a linear ordinary least-squares regression¶
To identify if there is a statistically significant relationship between asthma prevalence and greenspace metrics, we can run a linear ordinary least squares (OLS) regression and measure how well it is able to predict asthma with the given vegetation metics.
OLS has 5 main assumptions:
- The relationship between x and y is linear.
- Errors are normally distributed.
- Independence: In the case that there are multiple x variables, colinearity needs to be avoided in order to avoid making false correlations with the wrong variable. (Not the case for this assignement since we just have asthma vs. greenspace).
- Stationarity: The parameters of the model do not vary over time.
- Complete observations: Can't have "No Data" values.
Model Objective:
The objective of this model is to determine if there is a statistically significant relationship between asthma rates and greenspace values. For this assignment, we are comparing asthma rates to three potential different vegetation metrics: mean patch size, edge density (fragmentation), and percent vegetation (frac_veg). We need to plot the data for each of these metrics to see if they fit the assumptions of OLS. In particular, we want to look for a linear relationship between x and y, and a normal distribution.
Advantages and potential problems:
If in fact greenspace and asthma rates are correlated in some way, this is a good model to find that correlation. Potential problems with using this model is that these are big datasets, and there may be issues like uneven distributions. Additionally, there are likely other variables besides greenspace, such as other environmental factors, economic, racial groups, healthcare, etc that could affect the data. The OLS model also does not account for spatial autocorrelation.
### Visualize distribution of data
### select variables
variables = ['frac_veg','asthma','mean_patch_size','edge_density']
hvplot.scatter_matrix(
columbus_ndvi_cdc_gdf
[variables]
)
Figure 4: Histogram and scatter plots comparing all variables with each other in proparation for fitting an OLS model. The above histograms show fairly normal distributions for both edge density and percent vegetation (frac_veg) statistics. However, asthma and mean patch size are somewhat skewed, so they are candidates for log transformation. Additionally, in the scatter plots, some of the highest levels of correlation appears between edge density and frac_veg.
### histograms (another way of making them)
### make facet
fig, axes = plt.subplots(2, 2, figsize = (12, 10))
### list variables to plot
variables = ['frac_veg','asthma','mean_patch_size','edge_density']
titles = ['Vegetated fraction','Adult asthma rate','Mean patch size','Edge density']
### loop through variables and axes to plot histograms
for i, (var, title) in enumerate(zip(variables, titles)):
ax = axes[i//2, i%2]
ax.hist(columbus_ndvi_cdc_gdf[var], bins = 10, color = 'gray', edgecolor = 'black')
ax.set_title(title)
ax.set_xlabel(var)
ax.set_ylabel('Frequency')
### adjust layers to prevent overlap
plt.tight_layout()
plt.show()
Figure 5: Demonstration of how to plot the variables simply has histograms (same results).
Drop missing values as these could skew the model.
### Drop missing observations
model_df = (
columbus_ndvi_cdc_gdf
### make a copy
.copy()
### subset to columns
[['frac_veg','asthma','mean_patch_size','edge_density','geometry']]
### drop rows (observations) with missing values
.dropna()
)
model_df
| frac_veg | asthma | mean_patch_size | edge_density | geometry | |
|---|---|---|---|---|---|
| 0 | 0.467575 | 10.3 | 148.579661 | 0.189672 | POLYGON ((-9241687.822 4875035.426, -9241688.4... |
| 1 | 0.516840 | 10.0 | 163.790344 | 0.125395 | POLYGON ((-9239407.998 4873815.462, -9239390.7... |
| 2 | 0.532251 | 10.2 | 206.634631 | 0.179721 | POLYGON ((-9239360.353 4872960.063, -9239346.3... |
| 3 | 0.385892 | 10.0 | 134.296520 | 0.114323 | POLYGON ((-9239346.327 4872707.908, -9239330.8... |
| 4 | 0.210889 | 11.6 | 87.108697 | 0.157271 | POLYGON ((-9239195.377 4872909.315, -9239210.7... |
| ... | ... | ... | ... | ... | ... |
| 199 | 0.352816 | 10.3 | 58.381625 | 0.038817 | POLYGON ((-9254031.153 4874089.169, -9254045.3... |
| 200 | 0.028144 | 10.5 | 142.893443 | 0.012675 | MULTIPOLYGON (((-9252946.678 4868317.011, -925... |
| 201 | 0.105373 | 14.3 | 143.578431 | 0.066807 | POLYGON ((-9252589.564 4857148.339, -9252620.5... |
| 202 | 0.119766 | 10.1 | 34.042908 | 0.061626 | POLYGON ((-9241942.521 4861874.191, -9241967.7... |
| 203 | 0.011225 | 13.4 | 20.346939 | 0.014952 | MULTIPOLYGON (((-9218532.58 4856069.647, -9218... |
204 rows × 5 columns
Below I'm performing a log transformation for the asthma and mean patch size variables since both of these were somewhat skewed, and a normal distribution is more optimal for an OLS model.
### Perform variable transformation
### log of asthma rate (to try to make it more normally distributed)
model_df['log_asthma'] = np.log(model_df.asthma)
### log of patch size
model_df['log_patch'] = np.log(model_df.mean_patch_size)
model_df
| frac_veg | asthma | mean_patch_size | edge_density | geometry | log_asthma | log_edge_den | log_patch | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.467575 | 10.3 | 148.579661 | 0.189672 | POLYGON ((-9241687.822 4875035.426, -9241688.4... | 2.332144 | -1.662459 | 5.001121 |
| 1 | 0.516840 | 10.0 | 163.790344 | 0.125395 | POLYGON ((-9239407.998 4873815.462, -9239390.7... | 2.302585 | -2.076290 | 5.098587 |
| 2 | 0.532251 | 10.2 | 206.634631 | 0.179721 | POLYGON ((-9239360.353 4872960.063, -9239346.3... | 2.322388 | -1.716350 | 5.330952 |
| 3 | 0.385892 | 10.0 | 134.296520 | 0.114323 | POLYGON ((-9239346.327 4872707.908, -9239330.8... | 2.302585 | -2.168725 | 4.900050 |
| 4 | 0.210889 | 11.6 | 87.108697 | 0.157271 | POLYGON ((-9239195.377 4872909.315, -9239210.7... | 2.451005 | -1.849786 | 4.467157 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 199 | 0.352816 | 10.3 | 58.381625 | 0.038817 | POLYGON ((-9254031.153 4874089.169, -9254045.3... | 2.332144 | -3.248902 | 4.067001 |
| 200 | 0.028144 | 10.5 | 142.893443 | 0.012675 | MULTIPOLYGON (((-9252946.678 4868317.011, -925... | 2.351375 | -4.368087 | 4.962099 |
| 201 | 0.105373 | 14.3 | 143.578431 | 0.066807 | POLYGON ((-9252589.564 4857148.339, -9252620.5... | 2.660260 | -2.705946 | 4.966881 |
| 202 | 0.119766 | 10.1 | 34.042908 | 0.061626 | POLYGON ((-9241942.521 4861874.191, -9241967.7... | 2.312535 | -2.786673 | 3.527622 |
| 203 | 0.011225 | 13.4 | 20.346939 | 0.014952 | MULTIPOLYGON (((-9218532.58 4856069.647, -9218... | 2.595255 | -4.202937 | 3.012930 |
204 rows × 8 columns
### Visualize transformed variables
hvplot.scatter_matrix(
model_df
[[
'frac_veg',
'edge_density',
'log_patch',
'log_asthma'
]]
)
Figure 6: Results after performing the log transformation on the asthma and mean patch size variables. All the histograms look much more normally distributed now.
To verify our visual interpretation of the normal distribution of the histograms above, we can plot with Q-Q Plots to better visualize this.
### q-q plots
### set variables
var_qq = ['frac_veg', 'edge_density', 'log_patch', 'log_asthma']
### make q-q plot for each variables
plt.figure(figsize = (12, 10))
for i, var in enumerate(var_qq, 1):
### make 2x2 facet
plt.subplot(2, 2, i)
### norm distribution
stats.probplot(model_df[var], dist = "norm", plot = plt)
### add title
plt.title(f'Q-Q plot for {var}')
### plot it
plt.tight_layout()
plt.show()
Figure 7: Q-Q Plots showing distribution of each of the variables of interest. The data are generally very close to the red line (normal distribution) in all of the plots above, meaning they are all very close to a normal distribution.
### Select predictor and outcome variables
X = model_df[['frac_veg','log_patch']]
y = model_df[['log_asthma']]
### Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.33, random_state = 19
)
### Fit a linear regression
reg = LinearRegression()
### fit it with our data
reg.fit(X_train, y_train)
### Predict asthma values for the test dataset
y_test['pred_asthma'] = np.exp(reg.predict(X_test))
### real asthma rate for conparison
y_test['asthma'] = np.exp(y_test.log_asthma)
### check it out
y_test
| log_asthma | pred_asthma | asthma | |
|---|---|---|---|
| 62 | 2.525729 | 12.327742 | 12.5 |
| 129 | 2.564949 | 12.150397 | 13.0 |
| 152 | 2.406945 | 11.284053 | 11.1 |
| 94 | 2.442347 | 11.493972 | 11.5 |
| 188 | 2.332144 | 11.680175 | 10.3 |
| ... | ... | ... | ... |
| 61 | 2.322388 | 11.973013 | 10.2 |
| 144 | 2.360854 | 11.339591 | 10.6 |
| 201 | 2.660260 | 10.678226 | 14.3 |
| 131 | 2.468100 | 11.574451 | 11.8 |
| 36 | 2.517696 | 11.901622 | 12.4 |
68 rows × 3 columns
### Plot measured vs. predicted asthma prevalence with a 1-to-1 line
y_max = y_test.asthma.max()
(
y_test
.hvplot.scatter(
x='asthma',
y='pred_asthma',
xlabel = "Measured adult asthma prevalence",
ylabel = "Predicted adult asthma prevalence",
title = "Linear regression performance - testing data")
.opts(aspect='equal',
xlim=(0, y_max), ylim=(0, y_max),
width=600, height=600)
) * hv.Slope(slope=1, y_intercept=0).opts(color='black')
### Compute model error for all census tracts
### use the trained model to predict asthma prevalence for each census tract
model_df['pred_asthma'] = np.exp(reg.predict(X))
### calculate model error
model_df['error_asthma'] = model_df['pred_asthma'] - model_df['asthma']
### Plot error geographically as a chloropleth
(
plot_chloropleth(
model_df,
color='error_asthma',
cmap='RdBu',
title='OLS Model Error for Predicting Asthma Rates in Columbus, OH',
clabel='Error value',
colorbar_opts={
'ticker': FixedTicker(ticks=[-5, -1, 3]),
'major_label_overrides': {
-5: 'Under-predicted (positive)',
-1: 'Close fit',
3: 'Over-predicted (negative)'
}
}
)
.opts(frame_width=600, aspect='equal')
)